# Diffusion‑Aligned Embeddings (DAE)

This repository contains a reference implementation of the Diffusion‑Aligned
Embeddings (DAE) algorithm.  DAE is a method for constructing weighted
affinity graphs with self‑tuning bandwidths and for computing embeddings via
continuous‑time Markov chain (CTMC) dynamics.  It supports a variety of
distance metrics, kernel families and k‑NN backends, and can be accelerated
with optional libraries such as FAISS, hnswlib, Annoy or pynndescent.

## Features

* Build a weighted k‑nearest neighbor graph with per‑sample bandwidths
  following the UMAP self‑tuning procedure.
* Choose from several symmetrization rules (mean, max, min, geometric,
  harmonic, UMAP) to fuse directed edge weights into an undirected graph.
* Use fast k‑NN search backends including FAISS, hnswlib, Annoy or
  pynndescent when available, or fall back to scikit‑learn.
* Select distance metrics (Euclidean, cosine, Mahalanobis) and register
  custom metrics at runtime.
* Choose kernel families (UMAP heavy‑tail, Student‑t, exponential, power‑law
  and more) or provide your own numba‑compiled kernel.
* Compute diffusion embeddings with a continuous‑time Markov chain optimizer
  (see `ctmc_engine.py`).

## Installation

Install the core package from source using pip:

```bash
pip install .
```

The core dependencies are `numpy`, `scipy`, `numba` and `scikit‑learn`.  To
enable optional k‑NN backends or symbolic kernel expressions, install one or
more of the extra sets:

```bash
pip install .[faiss]        # FAISS accelerated k‑NN
pip install .[hnswlib]      # hnswlib accelerated k‑NN
pip install .[annoy]        # Annoy accelerated k‑NN
pip install .[pynndescent]  # pynndescent accelerated k‑NN
pip install .[sympy]        # symbolic kernel expressions
```

## Quick start

Below is a minimal example that constructs a weighted k‑NN graph from
synthetic data and prints the number of edges:

```python
import numpy as np
from dae.ctmc_engine import CTMCEmbedding

# Generate synthetic data
X = np.random.randn(100, 5).astype(np.float32)
# Get enbeddings (see object for functionality). 

Y = CTMCEmbedding().fit_transform(X, n_epochs=500) 
```

See the docstrings in `ctmc_engine.py` for descriptions of all parameters and return values.

## Citing

If you use this code in your research please cite the DAE paper (insert
appropriate citation here).

## License

This project is licensed under the MIT License.  See the `LICENSE` file for
details.

## Reproduction of cpu experiments. 

To reproduce the results, upload the notebook to google colab and run the notebook. The datasets can be obtained from datasets.base.py
Dataframe containing the results of the ablations is also available. To use the package more generally run `pip install .`

## Large scale experiments. (see zip). 

Steps:
  - Run HVG selection `run_hvg.py`;
  - Download tissue slices with `download_slices.py` to fetch filtered datasets (slices) for the selected tissues;
  - Train scVI with `compute_scvi.py` to learn latent representations of the datasets using scVI.
  - Compute embeddings with `compute_embeddings.py` on the scVI latent space.
  - Plot embeddings with `plot_embeddings.py` to visualize the computed embeddings, colored by cell types.

